Tree Structured Multimedia Signal Modeling

نویسندگان

Weicheng Ma

Kai Cao

Xiang Li

Peter Chin

چکیده

Current solutions to multimedia modeling tasks feature sequential models and static tree-structured models. Sequential models, especially models based on Bidirectional LSTM (BLSTM) and Multilayer LSTM networks, have been widely applied on video, sound, music and text corpora. Despite their success in achieving state-of-the-art results on several multimedia processing tasks, sequential models always fail to emphasize short-term dependency relations, which are crucial in most sequential multimedia data. Tree-structured models are able to overcome this defect. The static tree-structured LSTM presented by Tai et al. (Tai, Socher, and Manning 2015) forcingly breaks down the dependencies between elements in each semantic group and those outside the group, while preserves chain-dependencies among semantic groups and among nodes in the same group. Though the tree-LSTM network is able to better represent the dependency structure of multimedia data, it requires the dependency relations of the input data to be known before it is fed into the network. This is hard to achieve since for most types of multimedia data there exists no parsers which can detect the dependency structure of every input sequence accurately enough. In order to preserve dependency information while eliminating the necessity of a perfect parser, in this paper we present a novel neural network architecture which 1) is self-expandable and 2) maintains the layered dependency structure of incoming multimedia data. We call our new neural network architecture Seq2Tree network. A Seq2Tree model is applicable on classification, prediction and generation tasks with task-specific adjustments of the model. We prove by experiments that our Seq2Tree model performs well in all the three types of tasks. Figure 1: A tree-structured model with three layers. Hidden states of lower-level nodes are inherited from parent nodes. Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Introduction Multimedia signal modeling is the basis of multimedia signal processing tasks across a wide range of research fields such as Computer Vision, Natural Language Processing, and Sound Signal Processing. However on sequential-datarelated research remain not fully developed due to flexibility in structure of sequential data. In this paper, we aim at tackling the problems static neural network solutions have in modeling sequential signals with a dynamically selfadjustable neural network architecture. In sequential data each unit contributes to the prediction of all its following units, so traditional sequential models such as Hidden Markov Model (HMM) and Recurrent Neural Networks (RNN) are the first choice when processing this kind of data. With the ability of weakening gradient vanishment and explosion problems, Long Short Term Memory (LSTM) network has been very popular in sequential data processing tasks. Seq2seq network (Sutskever, Vinyals, and Le 2014), one of the most famous applications of LSTM network for example, has thus attracted much attention in machine translation research (Luong et al. 2015). Its use has also been extended to multiple other tasks such as speech to text conversion (Zhang, Chan, and Jaitly 2017). Moreover, HMM achieves high performance in music style classification task, especially when differentiating composer characteristics (Buzzanca 2002; Chai and Vercoe 2001). However, simple sequential models over single data points can sometimes misrepresent complex multimedia signals. Thus, some variants of the sequential models are introduced. Bidirectional LSTM (BLSTM) network (Schuster and Paliwal 1997) for example, combines two LSTM networks each accepting the input forwards and backwards. Because of the backward LSTM, BLSTM is able to foresee possible boundaries of unit patterns in the future. The current state-of-the-art results in speech and noise separation task, as is reported in the 2nd CHiME challenge (Vincent et al. 2013), is achieved by a BLSTM-based system (Erdogan et al. 2015). Multilayer LSTM is another popular variant of the original LSTM network. This neural network architecture allows different features to sit on different layers of the network, so as to divide a sequence into patterns based on learnt boundary characteristics. As an example, a three-layer Seq2seq model achieves over 95% accuracy in the evaluation of constituent parsers (Vinyals et al. 2015). Nevertheless, both BLSTM and Multilayer LSTM are additive combinations of multiple original LSTM networks which easily lose geometric information of the units. This undermines the performance of systems based on BLSTM and Multilayer LSTM. With careful examination of multiple multimedia data samples we found that most multimedia signals share two characteristics: 1) inner-group dependencies are stronger for every meaningful unit group and 2) meaningful groups form a chain-structured dependency path. These characteristics of temporally successive multimedia data lead us to a natural selection of a tree-structured representation which 1) expands along one direction and 2) branches only when a pattern starts while 3) ends a branch and continues expanding on higher level when it reaches the end of current pattern. This special tree structure satisfies our needs of modeling multimedia signals in terms of meaningful segments but not single units by locating the units of the same semantic group in the same subtree. We call these meaningful unit groups segments. Currently no existing model works for bounding segments with flexible length. To build up such tree structure from sequential input, we highlight the ability of our self-expandable tree model to find boundaries of segments by branching at proper positions. We call this novel neural network architecture Seq2Tree network. For generosity, we fit our tree model to three different types of tasks, namely classification, prediction and generation tasks with necessary modifications to the network. We designed experiments to prove the correctness of the our tree-structured models built by Seq2Tree network and its advancement over the traditional LSTM-based models. Seq2Tree networks were introduced by (Ma et al. 2018). The structure has also been used in AI tasks such as signal processing (Ma et al. 2017). Multimedia Signal Modeling Among all types of multimedia signals we concentrate on temporally successive signals. To be more specific, in this paper we focus on text, video, music and sound signals. We mainly study mainly three types of tasks, namely classification, prediction and generation tasks. Classification Model Figure 2: A tree-structured classification model. Information from every top-level node is summarized into the top node. The hidden state of the top node can be fed into a classifier. As is shown in Figure 2, in the classification model there is one root node above the entire tree structure. The root node incorporates the hidden state of all top-level nodes in the original tree structure. For the classification model we build a softmax classification model by adding a softmax layer on top of the root node subject to the error function: p(y|htop) = softmax(U htop + b), ytop = argmaxyp(y|htop) where U (c) is the classification matrix, b is the bias and htop is the hidden state at the top of the tree. The cost function we choose here is the cross-entropy loss of the predicted label y:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sound Signal Processing with Seq2Tree Network

Long Short-Term Memory (LSTM) and its variants have been the standard solution to sequential data processing tasks because of their ability to preserve previous information weighted on distance. This feature provides the LSTM family with additional information in predictions, compared to regular Recurrent Neural Networks (RNNs) and Bag-of-Words (BOW) models. In other words, LSTM networks assume...

متن کامل

Minimum error rate training for designing tree-structured probability density function

In this paper, we propose a signal prototype classi cation and evaluation framework in acoustic modeling. Based on this framework, a new tree-structured likelihood function is derived. It uses a designated cluster kernel f m for signal prototype classi cation and a designated cluster kernel f m for likelihood evaluation of outlier or tail events of the cluster. A minimum classi cation error (MC...

متن کامل

Acoi: A System for Indexing Multimedia Objects

In this paper, we present a system that combines independent feature detector programs with multimedia database technology to provide a semantic rich index to multimedia data items on the World Wide Web. First, we introduce a grammatical framework, called feature grammars, which forms the indexing schema. Feature grammars are an extension of context-free grammars with active symbols (e.g. multi...

متن کامل

Structured Noncommutative Multidimenisonal Linear Systems and Scale-Recursive Modeling

Recently, the multiscale signal and image processing community has recognized that a suitable model for multiresolution processes is a model with time-like variable indexed by the nodes on a homogeneous tree with different depths in the tree corresponding to different spatial scales associated with the signal or image. It turns out that these system models are close relatives of the Structured ...

متن کامل

A universal k-tree model for content-based multimedia retrieval

In this paper, we propose a unified model for indexing and retrieving multimedia data using characteristic features and multiresolution processing. Each multimedia datatype, such as audio and video, is represented as a k-dimensional signal in the spatio-temporal domain. A k-dimensional signal is transformed into characteristic features (contents) and these features are stored in a hierarchical ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Tree Structured Multimedia Signal Modeling

نویسندگان

چکیده

منابع مشابه

Sound Signal Processing with Seq2Tree Network

Minimum error rate training for designing tree-structured probability density function

Acoi: A System for Indexing Multimedia Objects

Structured Noncommutative Multidimenisonal Linear Systems and Scale-Recursive Modeling

A universal k-tree model for content-based multimedia retrieval

عنوان ژورنال:

اشتراک گذاری